Refactor/preprocessing/text normalization by ZenithClown · Pull Request #21 · sharkutilities/NLPurify

ZenithClown · 2025-10-23T10:06:22Z

📜 Description

This PR brings the following change(s):

🐛 Bug Fix - a non-breaking change that fixes an issue.
🌟 New Feature - a non-breaking change that adds new functionality.
🛠️ Breaking Change - can either be a fix or a feature that might change the functionality to not work as expected.
📖 Documentation Change - changes or updates the documentation of a particular project, function, etc.

Fixes # (issue number)

✔️ Checks

Tests are added and passed for all relevant functions (if any).
My code follows the style guidelines as mentioned.
In case of new functionality, proper comments have been added to the code.
Changes do not generate any warnings for existing functions.
Add a new entry in documentation, if fixing a bug or added a new feature.

- grouping of similar functions into submodule is the logical approach - normalization of text is part of pre-processing of raw texts

- new version uses the modular approach with .apply() method to clean texts of white space - all the keyword arguments are internally processed to make the model and run the underlying function - 📃 updated documentation of the model and function - 💣removed deprecated function strip_whitespace() from method - 💣 updated init-time optimization from the module

- 🚧 documentation and field validation pending

- added extra words options to be removed, fixes #13 - word tokenization and stop words removal are now in one modular method - 💣 this deprecates internal nlpurify/feature/selection/nltk.py methods - added attribute control to check stop words with desired case folding (upper/lower) as per final string's case folding requirements

- added example in jupyter notebooks - added preprocessing utility methods - modularize word tokenization in stop words selection

ZenithClown added 7 commits October 22, 2025 15:26

💣 normalization is now part of nlp-utils/preprocessing module

4986a86

- grouping of similar functions into submodule is the logical approach - normalization of text is part of pre-processing of raw texts

🩹🚧 patching normalization process with pydantic and abc

a0975d3

✨ added case folding normalization technique

74fdcda

- 🚧 documentation and field validation pending

💣 refactor feature/selection section

de50c94

✨💣 refactor word tokenization in nlpurify.preprocessing.utils

88d5b8f

- added example in jupyter notebooks - added preprocessing utility methods - modularize word tokenization in stop words selection

ZenithClown merged commit 7df63f0 into master Oct 23, 2025
5 of 6 checks passed

ZenithClown deleted the refactor/preprocessing/text-normalization branch October 23, 2025 10:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor/preprocessing/text normalization#21

Refactor/preprocessing/text normalization#21
ZenithClown merged 7 commits into
masterfrom
refactor/preprocessing/text-normalization

ZenithClown commented Oct 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZenithClown commented Oct 23, 2025

📜 Description

✔️ Checks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant